{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QqIu96YO9lb8"
   },
   "source": [
    "# Inference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_bJFitQy9lcG"
   },
   "source": [
    "## An inferential problem: Wage Gap based on Sex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CnD-7-vI9lcH"
   },
   "source": [
    "In the previous lab, we already analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BCOUK20N9lcI"
   },
   "source": [
    "**What is the difference in predicted wages between male and female workers with the same job-relevant characteristics?**\n",
    "\n",
    "Thus, we analyze if there is a difference in the payment of male and female workers (wage gap). The wage gap may partly reflect discrimination against female workers in the labor market or may partly reflect a selection effect, namely that female workers are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).\n",
    "\n",
    "To investigate the wage gap, we consider the following log-linear regression model\n",
    "\n",
    "\\begin{align} \\log(Y) &= \\beta'X + \\epsilon\\\\ &= \\beta_1 D + \\beta_2' W + \\epsilon, \\end{align}\n",
    "where $D$ is the indicator of being female ($1$ if female and $0$ otherwise) and the $W$'s are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of male and female workers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Zjm2h8rt9lcJ"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf\n",
    "from sklearn.model_selection import train_test_split\n",
    "import sys\n",
    "from sklearn.base import BaseEstimator\n",
    "import warnings\n",
    "# ignore potential convergence warnings; for some small\n",
    "# penalty levels, tried out, optimization might not converge\n",
    "warnings.simplefilter('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Q2Z_SC7a9lcK"
   },
   "source": [
    "## Data Analysis\n",
    "\n",
    "We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Let us load the data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LYwPvICz9lcK"
   },
   "outputs": [],
   "source": [
    "file = (\"https://raw.githubusercontent.com/CausalAIBook/\"\n",
    "        \"MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv\")\n",
    "df = pd.read_csv(file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9UpPVZnh9lcL"
   },
   "source": [
    "To start our analysis, we compare the sample means for the different sexes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 226
    },
    "id": "fw1c4JHm9lcL",
    "outputId": "0f677a4f-17ae-4120-f189-2c8334e31506"
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 426
    },
    "id": "gNhXyu279lcM",
    "outputId": "c1aa98d2-c341-479e-a09c-32d11f49dab3"
   },
   "outputs": [],
   "source": [
    "table = pd.DataFrame()\n",
    "\n",
    "cols = [\"lwage\", \"sex\", \"shs\", \"hsg\", \"scl\",\n",
    "        \"clg\", \"ad\", \"ne\", \"mw\", \"so\", \"we\", \"exp1\"]\n",
    "\n",
    "table['Variable'] = [\"Log Wage\", \"Sex\", \"Less then High School\",\n",
    "                     \"High School Graduate\", \"Some College\",\n",
    "                     \"Gollage Graduate\", \"Advanced Degree\",\n",
    "                     \"Northeast\", \"Midwest\", \"South\", \"West\", \"Experience\"]\n",
    "\n",
    "table['All'] = df[cols].mean().values\n",
    "\n",
    "table['Male'] = df[df['sex'] == 0][cols].mean().values\n",
    "table['Female'] = df[df['sex'] == 1][cols].mean().values\n",
    "\n",
    "\n",
    "# Show results\n",
    "table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 320
    },
    "id": "o9K4z3T59lcN",
    "outputId": "18d93e64-ac44-4606-fc0a-db0fdb459667"
   },
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "HX98B_G19lcN",
    "outputId": "f54f8bfe-fe57-4787-a048-cc61aadffccb"
   },
   "outputs": [],
   "source": [
    "# print to Latex\n",
    "print(table.style.to_latex())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Q5wPwQk39lcO"
   },
   "source": [
    "In particular, the table above shows that the difference in average logwage between male and female workers is equal to  0,038."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "RGPNaEUC9lcO",
    "outputId": "d0ebcf0c-fff7-47e4-bc7f-ea162d8e76b8"
   },
   "outputs": [],
   "source": [
    "# compute difference\n",
    "table.loc[0, 'Female'] - table.loc[0, 'Male']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "h9Ax06WW9lcP"
   },
   "source": [
    "Thus, the unconditional wage gap is about  $3.8\\%$ for the group of never married workers (female workers get paid less on average in our sample). We also observe that never married female workers are relatively more educated than male workers and have lower working experience.\n",
    "\n",
    "This unconditional (predictive) effect of sex equals the coefficient  $\\beta$  in the univariate ols regression of  $Y$  on  $D$ :\n",
    "\n",
    "$$\n",
    "log(Y)=\\beta D+\\epsilon.\n",
    "$$\n",
    "\n",
    "We verify this by running an ols regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Kn6HiZnY9lcP",
    "outputId": "f54561d6-315c-48e5-9274-ad2a319fdf5b"
   },
   "outputs": [],
   "source": [
    "nocontrol_fit = smf.ols(\"lwage ~ sex\", data=df).fit()\n",
    "nocontrol_est = nocontrol_fit.params['sex']\n",
    "nocontrol_se = nocontrol_fit.HC3_se['sex']\n",
    "\n",
    "print(f\"The estimated sex coefficient is {nocontrol_est:.4f} \"\n",
    "      f\"and the corresponding robust standard error is {nocontrol_se:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7fOUkoYS9lcQ"
   },
   "source": [
    "Note that the standard error is computed to be robust to heteroskedasticity.\n",
    "\n",
    "Next, we run an ols regression of  $Y$  on  $(D,W)$  to control for the effect of covariates summarized in  $W$ :\n",
    "\n",
    "$$\n",
    "log(Y)=\\beta_1 D + \\beta_2' W + \\epsilon.\n",
    "$$\n",
    "\n",
    "Here, we are considering the flexible model from the previous lab. Hence,  $W$ controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.\n",
    "\n",
    "Let us run the ols regression with controls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "DvaliSU29lcQ",
    "outputId": "fdbe8b22-50db-4a42-d263-0384adf68ec2"
   },
   "outputs": [],
   "source": [
    "# Ols regression with controls\n",
    "\n",
    "flex = \"lwage ~ sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)\"\n",
    "\n",
    "# Note that ()*() operation in formula objects in statsmodels\n",
    "# creates a formula of the sort:\n",
    "#   (exp1+exp2+exp3+exp4)+ (shs+hsg+scl+clg+occ2+ind2+mw+so+we)\n",
    "#   + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)\n",
    "# This is not intuitive at all, but that's what it does.\n",
    "\n",
    "control_fit = smf.ols(flex, data=df).fit()\n",
    "control_est = control_fit.params['sex']\n",
    "control_se = control_fit.HC3_se['sex']\n",
    "\n",
    "control_fit.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "JFtJiI1K9lcR",
    "outputId": "67414bcb-638d-4341-c8ed-8c8643a2f89e"
   },
   "outputs": [],
   "source": [
    "print(f\"The estimated sex coefficient is {control_est:.4f} \"\n",
    "      f\"and the corresponding robust standard error is {control_se:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zgH-fOfc9lcR"
   },
   "source": [
    "The estimated regression coefficient  $\\beta_1≈-0.06955$  measures how our linear prediction of wage changes if we set the sex variable  $D$  from 0 to 1, holding the controls  $W$  fixed. We can call this the predictive effect (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size  4 \\% for female workers increases to about  7 \\% after controlling for worker characteristics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now show how the conditional gap and the remainder decompose the marginal wage gap into the parts explained and unexplained by the additional controls. (Note that this does *not* explain why there is a difference in the controls to begin with in the two groups.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "flex_without_sex = \"lwage ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)\"\n",
    "lm0 = smf.ols(flex_without_sex, data=df[df[\"sex\"] == 0])\n",
    "lm1 = smf.ols(flex_without_sex, data=df[df[\"sex\"] == 1])\n",
    "XX0 = lm0.exog\n",
    "y0 = lm0.endog\n",
    "XX1 = lm1.exog\n",
    "y1 = lm1.endog\n",
    "# the coefficients excluding intercept and \"sex\"\n",
    "betarest = control_fit.params[control_fit.params.index != \"sex\"][1:]\n",
    "\n",
    "print(\"The marginal gap:\", y1.mean() - y0.mean())\n",
    "diff_unexplained = control_est\n",
    "print(\"The unexplained difference: \", diff_unexplained)\n",
    "diff_explained = betarest.dot(XX1.mean(0)[1:] - XX0.mean(0)[1:])\n",
    "print(\"The explained difference:\", diff_explained)\n",
    "print(\"The sum of these differences:\", diff_unexplained + diff_explained)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We next consider a Oaxaca-Blinder decomposition that also incorporates an interaction term."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "u1, s1, v1 = np.linalg.svd(XX1)\n",
    "u0, s0, v0 = np.linalg.svd(XX0)\n",
    "s1[s1 <= 1e-10] = 0\n",
    "s1[s1 > 1e-10] = 1 / s1[s1 > 1e-10]\n",
    "beta1 = (s1 * s1 * v1.T).dot(v1).dot(XX1.T).dot(y1)\n",
    "s0[s0 <= 1e-10] = 0\n",
    "s0[s0 > 1e-10] = 1 / s0[s0 > 1e-10]\n",
    "beta0 = (s0 * s0 * v0.T).dot(v0).dot(XX0.T).dot(y0)\n",
    "\n",
    "print(\"The marginal gap: \", y1.mean() - y0.mean())\n",
    "diff_unexplained = beta1[0] - beta0[0]\n",
    "print(\"The unexplained difference: \", diff_unexplained)\n",
    "diff_endow = beta0.dot(XX1.mean(0) - XX0.mean(0))\n",
    "print(\"The difference explained by endowment: \", diff_endow)\n",
    "diff_coeff = (beta1[1:] - beta0[1:]).dot(XX1.mean(0)[1:])\n",
    "print(\"The difference explained by coefficient: \", diff_coeff)\n",
    "print(\"The sum of these differences: \", diff_unexplained + diff_endow + diff_coeff)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "l25upD8N9lcS"
   },
   "source": [
    "Next, we are using the Frisch-Waugh-Lovell theorem from the lecture partialling-out the linear effect of the controls via ols."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "dz9XPQTD9lcS",
    "outputId": "01281b4d-3a09-43f5-e16b-909cdcc87a6a"
   },
   "outputs": [],
   "source": [
    "# Partialling-Out using ols\n",
    "\n",
    "# models\n",
    "flex_y = \"lwage ~  (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)\"  # model for Y\n",
    "flex_d = \"sex ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)\"  # model for D\n",
    "\n",
    "# partialling-out the linear effect of W from Y\n",
    "t_Y = smf.ols(flex_y, data=df).fit().resid\n",
    "# partialling-out the linear effect of W from D\n",
    "t_D = smf.ols(flex_d, data=df).fit().resid\n",
    "\n",
    "# regression of Y on D after partialling-out the effect of W\n",
    "partial_fit = sm.OLS(t_Y, t_D).fit()\n",
    "partial_est = partial_fit.params['x1']\n",
    "\n",
    "print(\"Coefficient for D via partialling-out \" + str(partial_est))\n",
    "\n",
    "# standard error\n",
    "partial_se = partial_fit.HC3_se['x1']\n",
    "\n",
    "# confidence interval\n",
    "print(\"95% CI: \" + str(partial_fit.conf_int().values[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "nLqitZ-jy-S-",
    "outputId": "6460d683-58c4-4f10-cc42-829b229ee06a"
   },
   "outputs": [],
   "source": [
    "print(f\"The estimated sex coefficient is {partial_est:.4f} \"\n",
    "      f\"and the corresponding robust standard error is {partial_se:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f_NwUdf09lcS"
   },
   "source": [
    "Again, the estimated coefficient measures the linear predictive effect (PE) of  $D$  on  $Y$  after taking out the linear effect of  $W$  on both of these variables. This coefficient equals the estimated coefficient from the ols regression with controls."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iskkYD3W9lcS"
   },
   "source": [
    "We know that the partialling-out approach works well when the dimension of $W$ is low in relation to the sample size  $n$ . When the dimension of  $W$  is relatively high, we need to use variable selection or penalization for regularization purposes.\n",
    "\n",
    "\n",
    "In the following, we illustrate the partialling-out approach using lasso instead of ols."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Ok2AYuGWEwB9"
   },
   "source": [
    "For Lasso, we use \"plug-in\" tuning with a theoretically valid choice of penalty $\\lambda = 2 \\cdot c \\hat{\\sigma} \\sqrt{n} \\Phi^{-1}(1-\\alpha/2p)$, where $c>1$ and $1-\\alpha$ is a confidence level, and $\\Phi^{-1}$ denotes the quantile function. Under homoskedasticity, this choice ensures that the Lasso predictor is well behaved, delivering good predictive performance under approximate sparsity. In practice, this formula will work well even in the absence of homoskedasticity, especially when the random variables $\\epsilon$ and $X$ in the regression equation decay quickly at the tails.\n",
    "\n",
    "In practice, many people choose to use cross-validation, which is perfectly fine for predictive tasks. However, when conducting inference, to make our analysis valid we will require cross-fitting in addition to cross-validation. As we have not yet discussed cross-fitting, we rely on this theoretically-driven penalty in order to allow for accurate inference in the upcoming notebooks.\n",
    "\n",
    "We pull an analogue of R's rlasso. Rlasso functionality: it is searching the right set of regressors. This function was made for the case of ***p*** regressors and ***n*** observations where ***p >>>> n***. It assumes that the error is i.i.d. The errors may be non-Gaussian or heteroscedastic.\\\n",
    "The post lasso function makes OLS with the selected ***T*** regressors.\n",
    "To select those parameters, they use $\\lambda$ as variable to penalize\\\n",
    "**Funny thing: the function rlasso was named like that because it is the \"rigorous\" Lasso.**\\\n",
    "We find a Python code that tries to replicate the main function of hdm r-package. It was made by [Max Huppertz](https://maxhuppertz.github.io/code/). His library is this [repository](https://github.com/maxhuppertz/hdmpy). If not using colab, download its repository and copy this folder to your site-packages folder. In my case it is located here ***C:\\Python\\Python38\\Lib\\site-packages*** . We need to install this package ***pip install multiprocess***."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "OumtLIGBEoRk",
    "outputId": "04e30117-7e1f-4801-814f-28aee93f5a30"
   },
   "outputs": [],
   "source": [
    "!pip install multiprocess\n",
    "!pip install pyreadr\n",
    "!git clone https://github.com/maxhuppertz/hdmpy.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bTobyc3uFwnh"
   },
   "outputs": [],
   "source": [
    "sys.path.insert(1, \"./hdmpy\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_MU12XLoFxDr"
   },
   "outputs": [],
   "source": [
    "import hdmpy\n",
    "\n",
    "\n",
    "# We wrap the package so that it has the familiar sklearn API\n",
    "class RLasso(BaseEstimator):\n",
    "\n",
    "    def __init__(self, *, post=True):\n",
    "        self.post = post\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        pred = np.array(X) @ np.array(self.rlasso_.est['beta']).flatten()\n",
    "        pred += np.array(self.rlasso_.est['intercept'])\n",
    "        return pred\n",
    "\n",
    "    def nsel(self):\n",
    "        return sum(abs(np.array(self.rlasso_.est['beta']).flatten() > 0))\n",
    "\n",
    "\n",
    "def lasso_model():\n",
    "    return RLasso(post=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "N_ZBt97j9lcT",
    "outputId": "0df67cc6-9109-45ad-f0a8-5a7ae781dd63"
   },
   "outputs": [],
   "source": [
    "# Partialling-Out using lasso\n",
    "\n",
    "# models\n",
    "X_d = smf.ols(flex_d, data=df).data.exog[:, 1:]  # exclude intercept so that lasso does not penalize it\n",
    "y_d = smf.ols(flex_d, data=df).data.endog\n",
    "\n",
    "X_y = smf.ols(flex_y, data=df).data.exog[:, 1:]  # exclude intercept so that lasso does not penalize it\n",
    "y_y = smf.ols(flex_y, data=df).data.endog\n",
    "\n",
    "# partialling-out the linear effect of W from Y\n",
    "t_Y_lasso = y_y - lasso_model().fit(X_y, y_y).predict(X_y)\n",
    "\n",
    "# partialling-out the linear effect of W from D\n",
    "t_D_lasso = y_d - lasso_model().fit(X_d, y_d).predict(X_d)\n",
    "\n",
    "# regression of Y on D after partialling-out the effect of W\n",
    "partial_lasso_fit = sm.OLS(t_Y_lasso, t_D_lasso).fit()\n",
    "partial_lasso_est = partial_lasso_fit.params[0]\n",
    "\n",
    "print(\"Coefficient for D via partialling-out using lasso \" + str(partial_lasso_est))\n",
    "\n",
    "# standard error\n",
    "partial_lasso_se = partial_lasso_fit.HC3_se[0]\n",
    "\n",
    "# confidence interval\n",
    "print(\"95% CI: \" + str(partial_lasso_fit.conf_int()[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oRmJu6nm9lcT"
   },
   "source": [
    "Using lasso for partialling-out here provides similar results as using ols."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 175
    },
    "id": "UpbveucD9lcU",
    "outputId": "333fd562-6bfa-4878-e698-4dd4c7c4f7a7"
   },
   "outputs": [],
   "source": [
    "table2 = pd.DataFrame()\n",
    "\n",
    "table2['Model'] = [\"Without controls\", \"full reg\", \"partial reg\", \"partial reg via lasso\"]\n",
    "\n",
    "table2['Estimate'] = [nocontrol_est, control_est, partial_est, partial_lasso_est]\n",
    "\n",
    "table2['Std. Error'] = [nocontrol_se, control_se, partial_se, partial_lasso_se]\n",
    "\n",
    "# Show results\n",
    "table2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ZQnds_-i9lcU",
    "outputId": "828bf217-ebe3-4212-88e5-602831b491d2"
   },
   "outputs": [],
   "source": [
    "# print to Latex\n",
    "print(table2.style.to_latex())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jb17Q6AN9lcU"
   },
   "source": [
    "It it worth to notice that controlling for worker characteristics increases the wage gap from less that 4\\% to 7\\%. The controls we used in our analysis include 5 educational attainment indicators (less than high school graduates, high school graduates, some college, college graduate, and advanced degree), 4 region indicators (midwest, south, west, and northeast); a quartic term (first, second, third, and fourth power) in experience and 22 occupation and 23 industry indicators.\n",
    "\n",
    "Keep in mind that the predictive effect (PE) does not only measure discrimination (causal effect of being female), it also may reflect selection effects of unobserved differences in covariates between male and female workers in our sample."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QoXWF56-c7lV"
   },
   "source": [
    "## OLS Overfitting\n",
    "\n",
    "Next we motivate the usage of lasso. We try an \"extra\" flexible model, where we take interactions of all controls, giving us about 1000 controls. To illustrate how OLS can overfit, we subset to the first 1000 observations so that $p \\approx n$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tHF782JMtbrt"
   },
   "outputs": [],
   "source": [
    "subset, test = train_test_split(df, train_size=1000, random_state=2724)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "G7L90Bpa9lcV",
    "outputId": "33b884ff-df46-434d-ff9a-75680222d6f8"
   },
   "outputs": [],
   "source": [
    "# model\n",
    "extraflex = \"lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)**2\"\n",
    "\n",
    "control_fit_extra = smf.ols(extraflex, data=subset).fit()\n",
    "control_extra_est = control_fit_extra.params['sex']\n",
    "\n",
    "n = subset.shape[0]\n",
    "p = control_fit_extra.params.shape[0]\n",
    "\n",
    "# Jackknife. Python regularizes in the background. Theory shouldn't really work here.\n",
    "# Cattaneo, Jannson, and Newey (2018) (CJN) do verify that jackknife is valid,\n",
    "# in the sense of being upward biased, in high-dimensional settings under\n",
    "# regularity.\n",
    "control_extra_se = control_fit_extra.HC3_se[\"sex\"]\n",
    "# HC0 will not work here. CJN show it is inconsistent even\n",
    "# in moderate dimensional settings.\n",
    "control_extra_se_HC0 = control_fit_extra.HC0_se[\"sex\"]\n",
    "# We should probably implement Cattaneo, Jannson, and Newey (2018)'s procedure -\n",
    "# though it's not clear their regularity conditions would apply here anyway.\n",
    "\n",
    "print(\"Number of Extra-Flex Controls \" + str(p - 1))\n",
    "\n",
    "print(\"Coefficient for OLS with extra flex controls \" + str(control_extra_est))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "YhHr9Y7Cy-TD",
    "outputId": "cd5dd814-0132-4ed3-fd54-b2873a37e303"
   },
   "outputs": [],
   "source": [
    "# Partialling-Out using lasso\n",
    "\n",
    "# model for Y\n",
    "extraflex_y = \"lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)**2\"\n",
    "# model for D\n",
    "extraflex_d = \"sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)**2\"\n",
    "\n",
    "# models\n",
    "\n",
    "# exclude intercept so that lasso does not penalize it\n",
    "X_d = smf.ols(extraflex_d, data=subset).data.exog[:, 1:]\n",
    "y_d = smf.ols(extraflex_d, data=subset).data.endog\n",
    "\n",
    "# exclude intercept so that lasso does not penalize it\n",
    "X_y = smf.ols(extraflex_y, data=subset).data.exog[:, 1:]\n",
    "y_y = smf.ols(extraflex_y, data=subset).data.endog\n",
    "\n",
    "# partialling-out the linear effect of W from Y\n",
    "t_Y_lasso = y_y - lasso_model().fit(X_y, y_y).predict(X_y)\n",
    "\n",
    "# partialling-out the linear effect of W from D\n",
    "t_D_lasso = y_d - lasso_model().fit(X_d, y_d).predict(X_d)\n",
    "\n",
    "# regression of Y on D after partialling-out the effect of W\n",
    "partial_lasso_fit_extra = sm.OLS(t_Y_lasso, t_D_lasso).fit()\n",
    "partial_lasso_est_extra = partial_lasso_fit.params[0]\n",
    "\n",
    "print(\"Coefficient for D via partialling-out using lasso \" + str(partial_lasso_est_extra))\n",
    "\n",
    "# standard error\n",
    "partial_lasso_se_extra = partial_lasso_fit.HC0_se[0]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 143
    },
    "id": "EyElA7c49lcY",
    "outputId": "c16c851a-4a80-4d2a-a220-7a6620e13e2b"
   },
   "outputs": [],
   "source": [
    "table3 = pd.DataFrame()\n",
    "\n",
    "table3['Model'] = [\"full reg HC0\", \"full reg HC3\", \"partial reg via lasso\"]\n",
    "\n",
    "table3['Estimate'] = [control_extra_est, control_extra_est, partial_lasso_est_extra]\n",
    "\n",
    "table3['Std. Error'] = [control_extra_se_HC0, control_extra_se, partial_lasso_se_extra]\n",
    "\n",
    "# Show results\n",
    "table3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "1NEU_9xj9lcY",
    "outputId": "53c7e1dc-3fcc-469c-80ad-c9f2b5d47a8b"
   },
   "outputs": [],
   "source": [
    "# print to Latex\n",
    "print(table3.style.to_latex())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cEg8OVKk9lcZ"
   },
   "source": [
    "In this case $p/n \\approx 1$, that is  $p/n$  is no longer small and we start seeing the differences between unregularized partialling out and regularized partialling out with lasso (double lasso). The results based on double lasso have rigorous guarantees in this non-small $p/n$ regime under approximate sparsity. The results based on OLS still have guarantees in $p/n< 1$ regime under assumptions laid out in Cattaneo, Newey, and Jansson (2018), without approximate sparsity, although other regularity conditions are needed."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
